1,861,661 research outputs found
Features based text similarity detection
As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result, plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences. Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection
Recommended from our members
Research Collaboration Analysis Using Text and Graph Features
Patterns of scientific collaboration and their effect on scientific production have been the subject of many studies. In this paper we analyze the nature of ties between co-authors and study collaboration patterns in science from the perspective of semantic similarity of authors who wrote a paper together and the strength of ties between these authors (i.e. how much have they previously collaborated together). These two views of scientific collaboration are used to analyze publications in the TrueImpactDataset [11], a new dataset containing two types of publications - publications regarded as seminal and publications regarded as literature reviews by field experts. We show there are distinct differences between seminal publications and literature reviews in terms of author similarity and the strength of ties between their authors. In particular, we find that seminal publications tend to be written by authors who have previously worked on dissimilar problems (i.e. authors from different fields or even disciplines), and by authors who are not frequent collaborators. On the other hand, literature reviews in our dataset tend to be the result of an established collaboration within a discipline. This demonstrates that our method provides meaningful information about potential future impacts of a publication which does not require citation information
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
TRECVid 2006 experiments at Dublin City University
In this paper we describe our retrieval system and experiments performed for the automatic search task in TRECVid 2006. We submitted the following six automatic runs:
• F A 1 DCU-Base 6: Baseline run using only ASR/MT text features.
• F A 2 DCU-TextVisual 2: Run using text and visual features.
• F A 2 DCU-TextVisMotion 5: Run using text, visual, and motion features.
• F B 2 DCU-Visual-LSCOM 3: Text and visual features combined with concept detectors.
• F B 2 DCU-LSCOM-Filters 4: Text, visual, and motion features with concept detectors.
• F B 2 DCU-LSCOM-2 1: Text, visual, motion, and concept detectors with negative concepts.
The experiments were designed both to study the addition of motion features and separately constructed models for semantic concepts, to runs using only textual and visual features, as well as to establish a baseline for the manually-assisted search runs performed within the collaborative K-Space project and described in the corresponding TRECVid 2006 notebook paper. The results of
the experiments indicate that the performance of automatic search can be improved with suitable concept models. This, however, is very topic-dependent and the questions of when to include such models and which concept models should be included, remain unanswered. Secondly, using motion features did not lead to performance improvement in our experiments. Finally, it was observed that our text features, despite displaying a rather poor performance overall, may still be useful even for generic search topics
Strong correlations between text quality and complex networks features
Concepts of complex networks have been used to obtain metrics that were
correlated to text quality established by scores assigned by human judges.
Texts produced by high-school students in Portuguese were represented as
scale-free networks (word adjacency model), from which typical network features
such as the in/outdegree, clustering coefficient and shortest path were
obtained. Another metric was derived from the dynamics of the network growth,
based on the variation of the number of connected components. The scores
assigned by the human judges according to three text quality criteria
(coherence and cohesion, adherence to standard writing conventions and theme
adequacy/development) were correlated with the network measurements. Text
quality for all three criteria was found to decrease with increasing average
values of outdegrees, clustering coefficient and deviation from the dynamics of
network growth. Among the criteria employed, cohesion and coherence showed the
strongest correlation, which probably indicates that the network measurements
are able to capture how the text is developed in terms of the concepts
represented by the nodes in the networks. Though based on a particular set of
texts and specific language, the results presented here point to potential
applications in other instances of text analysis.Comment: 8 pages, 8 figure
- …